-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix in distributed GPU tests and Distributed set!
#3880
Conversation
…nto ss/fix-gpu-tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
commands: | ||
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'" | ||
agents: | ||
slurm_mem: 120G | ||
slurm_mem: 8G |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
120G is much more than we need for those tests. After some frustration, because tests were extremely slow to start, I noticed that the agents began much quicker by requesting a smaller memory amount. So I am deducing that the tests run on shared nodes instead of exclusive ones, and requesting lower resources allows us to squeeze in when the cluster is busy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good reason. might warrant a comment
Project.toml
Outdated
|
||
[targets] | ||
test = ["DataDeps", "Enzyme", "SafeTestsets", "Test", "TimesDates"] | ||
test = ["DataDeps", "SafeTestsets", "Test", "Enzyme", "MPIPreferences", "TimesDates"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this the crucial part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. For future generations, can you please write a little bit about what you tried and what ended up working? I can't tell if all the changes are necessary, though the end result is fairly clean. Mostly I am wondering about slurm_mem
. I'm also curious why we cannot call precompile_runtime
inside runtests.jl
and it is necessary to call it before Pkg.test()
. This has implications for the CI of other packages.
I think it is equivalent. I am trying to precompile inside the runtests. Btw, having again access to GPU distributed tests highlighted a bug related to distributed architectures specifically for the |
set!
there are two distinct issues with the GPU tests
Issue number (1) is solved, but, unfortunately, some tests fail stochastically because CUDA_runtime is not found. |
Ok, with some fiddling CUDA seems to be found correctly now. I think this Line 81 in 908b31a
However, I have added a failsafe option following this suggestion (which was the error we were encountering) |
.buildkite/distributed/pipeline.yml
Outdated
@@ -51,7 +51,7 @@ steps: | |||
commands: | |||
- "srun julia -O0 --color=yes --project -e 'using Pkg; Pkg.test()'" | |||
agents: | |||
slurm_mem: 8G | |||
slurm_mem: 50G |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can probably reduce the memory usage of the tests right? I think often a bigger grid is used than needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, unit tests do not require too much memory. I have seen that 32G was not enough for the regression tests on the GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they might be too big
…nto ss/fix-gpu-tests
This PR modifies the configuration of the distributed pipeline that runs on the Caltech cluster to allow using CUDA-aware MPI.
closes #3897